Compiler program, compiling method, information processing device

ABSTRACT

A compiler program causes a computer to execute optimization processing for an optimization target program. The optimization target program includes a loop including a vector store instruction and a vector load instruction for an array variable. The optimization processing includes (1) unrolling the vector store instruction and the vector load instruction in the loop by an unrolling number of times to generate a plurality of unrolled vector store instructions and a plurality of unrolled vector load instructions, and (2) scheduling to move an unrolled vector load instruction among the plurality of unrolled vector load instructions, which is located after a first unrolled vector store instruction that is located at first among the plurality of unrolled vector load instructions, before the first unrolled vector store instruction.

CROSS-REFERENCE TO RELATED APPLICATION

This application is based upon and claims the benefit of priority of theprior Japanese Patent Application No. 2020-100151, filed on Jun. 9,2020, the entire contents of which are incorporated herein by reference.

FIELD

The present invention relates to a compiler program, a compiling method,and an information processing device.

BACKGROUND

A compiler is a program that translates a program of a high-levelprogramming language, e.g. Fortran or C programming language, into anassembly code or object code (machine language). The main function ofthe compiler includes, for instance, an analysis function of analyzing asource code of a high-level programming language, a function ofgenerating an intermediate code of an intermediate language from thesource code, a function of optimizing the intermediate code, and a codegeneration function of converting the optimized intermediate code intoan assembly code or object code. The optimization function converts aplurality of scalar instructions into a vector instruction (or SIMD(Single Instruction Multiple Data) instruction, which is hereinafterreferred to as “vector instruction”), unrolls a loop (loop unrolling),or performs scheduling of changing the order of instructions, forinstance.

Japanese Patent Application Laid-open No. 2016-212573 and JapanesePatent Application Laid-open No. H09-160784 describe a compiler.

SUMMARY

A vector instruction loads, from consecutive addresses in a memory, anumber of array elements commensurate with a vector length (hereinaftera vector length array elements or a vector length elements) from thefirst array element of the array to be processed by the vectorinstruction, stores the array elements into a vector register, andexecutes the vector length array elements by a single instruction code.A vector instruction with a mask register accesses an element of anarray in a memory on the basis of a mask bit indicating whether theelement is an element to be accessed among the vector length arrayelements.

Meanwhile, when a sequence of instructions is decoded, a microprocessorchecks whether memory addresses to be accessed by a store instructionand memory addresses to be accessed by a subsequent load instructionoverlap with each other, and when the result of checking is true, themicroprocessor executes the instructions in accordance with the order ofa program code, namely, in accordance with such a rule that the loadinstruction is executed after the store instruction is executed. In thismanner, a phenomenon in which the load instruction is not executed untilthe store instruction immediately before the load instruction isfinished is referred to as “Store Fetch Interlock (SFI)”.

When a part of a mask bit of a mask register is TRUE (to be accessed)and the remaining mask bits are FALSE (not to be accessed), theabove-mentioned vector instruction with a mask register accesses only apart of elements of the vector length array elements. Thus, in thevector instruction with a mask register, occurrence of original SFI hasto be determined based on the mask bits of respective mask registers ofconsecutive store instruction and load instruction.

However, some microprocessor determines occurrence of SFI for a vectorinstruction on the basis of whether memory addresses of pieces of dataof the vector length elements from the first array element to beprocessed by consecutive vector store instruction and vector loadinstruction overlap with each other. In this manner, when the memoryaddresses of the vector length array elements overlap with each other,but the memory addresses of array elements to be accessed by the vectorinstruction on the basis of the mask register do not overlap with eachother, the original SFI does not occur, but SFI that depends onarchitecture of the microprocessor occurs. When the vector length of themicroprocessor becomes greater, the length of an array to be processedby the vector instruction also becomes greater, which extends the areaof memory addresses of the vector length array elements. As a result,the frequency of occurrence of SFI that depends on the above-mentionedarchitecture increases.

When the microprocessor executes consecutive store instruction and loadinstruction by in-order execution based on the SFI that depends on thearchitecture, the period of time for executing the instruction becomeslonger than the case of executing the consecutive store instruction andload instruction by out-of-order execution. The SFI that depends on thearchitecture is not original SFI, and the consecutive store instructionand load instruction do not have to be executed by in-order execution.Therefore, it is desirable to execute the consecutive store instructionand load instruction by out-of-order execution to reduce the executionperiod.

A first aspect of an embodiment is a non-transitory computer readablestorage medium that stores therein a compiler program causing a computerto execute optimization processing for an optimization target program,the optimization processing comprising: the optimization target programincluding a loop including a vector store instruction and a vector loadinstruction for an array variable, unrolling the vector storeinstruction and the vector load instruction in the loop by an unrollingnumber of times to generate a plurality of unrolled vector storeinstructions and a plurality of unrolled vector load instructions, theunrolling number of times including a first unrolling number of times,which is obtained by dividing a vector length by an array size of thearray variable and rounding up a remainder, or a second unrolling numberof times lower than the first unrolling number of times by one; andscheduling to move an unrolled vector load instruction among theplurality of unrolled vector load instructions, which is located after afirst unrolled vector store instruction that is located at first amongthe plurality of unrolled vector load instructions, before the firstunrolled vector store instruction.

According to the first aspect, a processor that executes a compilerprogram optimizes a vector instruction so as to avoid determination ofSFI that depends on the architecture.

The object and advantages of the invention will be realized and attainedby means of the elements and combinations particularly pointed out inthe claims.

It is to be understood that both the foregoing general description andthe following detailed description are exemplary and explanatory and arenot restrictive of the invention.

BRIEF DESCRIPTION OF DRAWINGS

FIG. 1 is a diagram illustrating an exemplary configuration of aninformation processing device that executes a compiler.

FIG. 2 is a diagram illustrating a flow chart of the compilingprocessing by the compiler.

FIG. 3 is a diagram illustrating a detailed flow chart of theoptimization processing S102 by the compiler.

FIG. 4 is a diagram illustrating an example of occurrence of SFI_2.

FIG. 5 is a diagram illustrating an exemplary relationship among aposition of an element of an array of the vector load instruction withthe mask register in the memory, the vector register, and the maskregister.

FIG. 6 is a diagram illustrating a method of avoiding SFI_2 byconversion into a single loop.

FIG. 7 is a diagram illustrating a method of avoiding SFI_2 byunpacking.

FIG. 8 is a diagram illustrating an access address relationship betweenthe vector load instruction vload (row 05) and the vector storeinstruction vstore (row 07) of the assembly pseudo code C22.

FIG. 9 is a diagram illustrating an access address relationship betweenthe vector load instruction vload (row 05) and the vector storeinstruction vstore (row 07) of the assembly pseudo code C23.

FIG. 10 is a diagram illustrating an example in which SFI_2 cannot beavoided by conversion into a single loop or unpacking.

FIG. 11 is a diagram illustrating optimization processing (1) ofavoiding SFI_2 in the embodiment.

FIG. 12 is a diagram illustrating loop unrolling and scheduling ofoptimization processing of avoiding SFI_2 in this embodiment.

FIG. 13 is a diagram illustrating a relationship between the array sizen and the value of the ceil function ceil(VL/n) in the case of vectorlength VL=16.

FIG. 14 is a diagram illustrating optimization processing (2) ofavoiding SFI_2 in the embodiment.

FIG. 15 is a diagram illustrating loop unrolling and scheduling ofoptimization processing of avoiding SFI_2 in this embodiment.

FIG. 16 is a diagram illustrating loop unrolling and scheduling in thecase of a source code in which the loop includes a load instructionbefore a store instruction.

FIG. 17 is a diagram illustrating loop unrolling and scheduling in thecase of a source code in which the loop includes a store instructionbefore a load instruction.

FIG. 18 is a diagram illustrating an example of a source code includingthe loop to be optimized and an intermediate pseudo code aftervectorization in the second embodiment.

FIG. 19 is a diagram illustrating an intermediate pseudo code C42 havingbranch structure in which the above-mentioned loop (rows 03 to 07) isconverted into case-specific loops.

FIG. 20 is a diagram illustrating an intermediate pseudo code C43obtained by loop unrolling each case-specific loop of the intermediatepseudo code C42, which has branch structure converted into thecase-specific loops of FIG. 19, by the corresponding number of times ofunrolling.

FIG. 21 is a diagram illustrating an example of a source code includingthe loop to be optimized and an intermediate pseudo code aftervectorization in the second embodiment.

FIG. 22 is a diagram illustrating an intermediate pseudo code C52 havingbranch structure in which the innermost loop (rows 03 to 07) of thevectorized intermediate pseudo code C51 is converted into case-specificloops.

FIG. 23 is a diagram illustrating an intermediate pseudo code C53obtained by unrolling each case-specific loop of the intermediate pseudocode C52 having branch structure in which the vectorized innermost loopis converted into the case-specific loops of FIG. 22 by thecorresponding number of times of unrolling.

FIG. 24 is a diagram illustrating a flow chart of overall optimizationprocessing of the compiler.

FIG. 25 is a diagram illustrating an example of the loop structure.

FIG. 26 is a diagram illustrating a flow chart of the processing of loopstructure analysis S1.

FIG. 27 and FIG. 28 are diagrams illustrating flow charts of thedetermination processing S20 of SFI_2.

FIG. 29 and FIG. 30 are diagrams illustrating flow charts of theprocessing S40 of determining a relationship between a vector storeinstruction and a vector load instruction in the same loop and inproceeding and subsequent loops.

DESCRIPTION OF EMBODIMENTS Definition of Terminology

In this specification, original SFI, which occurs when memory addressesof array elements to be accessed respectively by a preceding vectorstore instruction and a subsequent vector load instruction overlap witheach other, is referred to as “primary SFI” and “SFI_1”. In addition,SFI that depends on architecture of a microprocessor occurs when memoryaddresses of array elements to be accessed respectively by a precedingvector store instruction and a subsequent vector load instruction basedon mask registers do not overlap with each other (primary SFI does notoccur) but memory addresses of the vector length array elements of bothvector instructions overlap with each other. SFI that depends on thismicroarchitecture is referred to as “secondary SFI” or “SFI_2”. Inaddition, “memory addresses of the vector length array elements from thefirst array element of a vector instruction” is hereinafter simplyreferred to as “memory addresses of the vector length array elements ofa vector instruction”. And, a number of iteration times of a loop isreferred to as “iteration count”. A number of times of loop unrolling isreferred a number of times of unrolling.

A compiler optimizes an intermediate code, which is generated from asource code being a high-level programming language, and converts theintermediate code into an assembly code or object code. The intermediatecode is described in various languages depending on the compiler. Thus,in this specification, “intermediate pseudocode”, which describes anintermediate code in a pseudo manner, is used to describe an embodiment.The intermediate pseudocode may be described in a code identical to theassembly code. In a same manner, “assembly pseudocode” is used as theassembly code.

Information Processing Device and Compiling Method that Execute Compiler

FIG. 1 is a diagram illustrating an exemplary configuration of aninformation processing device that executes a compiler. The informationprocessing device 1 includes a central processing unit (CPU) 10 being aprocessor, a main memory 12 to be accessed by the processor, an externalinterface 14, a storage 20 being an auxiliary storage device, and aninternal bus 16 connecting those components to one another.

The storage 20 stores a compiler 21, a source code 23 to be compiled,and an assembly code or object code 25 obtained by converting the sourcecode by the compiler. Furthermore, during execution of the compiler, themain memory 12 stores a parameter 22 to be generated at the time ofcompiling by the compiler and an intermediate code 24 generated from thesource code. The processor loads the compiler 21 into the memory 12, andexecutes the loaded compiler. The source code is a code of a high-levelprogramming language, e.g., Fortran, C programming language, and C++.The intermediate code is a program code generated from the source codeby the processor that executes the compiler, and is subjected tooptimization processing of the compiling processing.

FIG. 2 is a diagram illustrating a flow chart of the compilingprocessing by the compiler. The processor executes the compiler toexecute next processing. The processor analyzes the source code 23 to becompiled (S100). Analysis of the source code includes, for instance,lexical analysis, syntactic analysis, and semantic analysis of thesource code. Next, the processor generates the intermediate code 24 fromthe source code 23 based on the result of analyzing the source code(S101). Then, the processor executes the processing of optimizing theintermediate code to obtain an intermediate code changed so as toachieve reduction of the calculation period (S102). After theoptimization processing, the processor converts the intermediate codeinto an assembly code or object code (S103).

FIG. 3 is a diagram illustrating a detailed flow chart of theoptimization processing S102 by the compiler. The processor executes thecompiler to execute the optimization processing S102 illustrated in FIG.3. The processor executes first optimization processing of theintermediate code (S110). The first processing includes, for instance,converting multi-loop structure into a single loop. Then, the processorexecutes vectorization of replacing the same instruction code thatprocesses of a plurality of pieces of data of the intermediate code witha vector instruction (S111). Then, the processor analyses the loopstructure to detect loop structure to be optimized in this embodiment(S112). In addition, the processor executes loop unrolling (S112) andscheduling (S113) for the loop structure to be optimized. Lastly, theprocessor executes second optimization processing (S114).

The loop unrolling is processing of expanding (unrolling) a instructionin the loop and reducing the number of times of iteration. Thescheduling is, for instance, processing of changing the order ofinstructions to enable the processor to execute the instructions byout-of-order execution.

SFI_2

FIG. 4 is a diagram illustrating an example of occurrence of SFI_2. Acode C1 is an example of the source code. In the source code C1, arraysa(8,32) and b(8,32) are declared as a real number (4) of 4-bytesingle-precision (row 02), and a primary loop (rows 04 to 06) isincluded in a secondary loop (rows 03 to 07). In the innermost (primary)loop (rows 04 to 06), a calculation instruction of adding array a(j,i)and b(j,i) to each other and inputting the sum into the array a(j,i) isdescribed (row 05). Furthermore, when the size of the vector register isset as 512 bit, and the size of an element of the array is set as 4-byteof single-precision as described above, a vector length VL is givenbelow.

Vector length VL=the size (512 bit) of the vector register/the size (4byte) of an element of the array=16.

Namely, the vector length VL is the maximum number of elements of thearray that can be processed by a vector instruction. When the size ofthe vector register is 512 bit, and the element size of the array is 8byte of double-precision, the vector length VL is 8.

When the calculation instruction of the innermost loop of the sourcecode C1 is vectorized, a vectorized intermediate pseudocode C2 isgenerated. In the intermediate pseudocode C2, the calculationinstruction of the innermost loop of the source code C1 is vectorized toobtain a calculation instruction (row 04) changed so as to add the arraya(1:8,i) and b(1:8,i) to each other and input the sum into the arraya(1:8,i). The intermediate pseudocode C2 is represented in a pseudomanner by a code identical to the source code C1.

Then, when the code C2 is converted into an assembly code, an assemblypseudocode C3 is obtained. Only the major instructions are extracted andare described in the assembly pseudocode C3. The loop (rows 03 to 05) ofthe code C2 is converted into a loop (rows 03 to 09) including anassembly code of a load instruction, an add instruction, a storeinstruction, and a subtraction instruction in the code C3. The codes ofthe row 02, the rows 04 to 07, and the row 08 have the followingmeanings.

row 02: A mask register “predicate_true ps0” has 4-byte (ps: predicatesingle) elements, and the first eight elements are TRUE “VL8” and theremaining eight elements are FALSE. A specific mask register MASK isdescribed on the right side of the row 02.

row 04: A vector load instruction vload with a mask register ps0 thatloads pieces of data of first to eighth elements of an array b(1:8,i)from the memory into a vector register vs2. The vector register vs2 has4-byte single-precision elements (vs: vector single).

row 05: A vector load instruction vload with the mask register ps0 thatloads pieces of data of first to eighth elements of an array a(1:8,i)from the memory into a vector register vs1. The vector register vs1 alsohas 4-byte single-precision array elements.

row 06: Add the vector registers vs1 and vs2, and store the sum intovs1.

row 07: A vector store instruction vstore with the mask register ps0that stores data of the vector register vs1 into the array a(1:8,i) inthe memory.

row 08: Subtract 1 from the value of a register d2 storing a controlvariable i of the loop to −1.

FIG. 5 is a diagram illustrating an exemplary relationship among aposition of an element of an array of the vector load instruction withthe mask register in the memory, the vector register, and the maskregister. In a same manner as in FIG. 4, the vector register size is 512bit, the element size of the array is 4 byte, and the vector length VLis 16. Pieces of data of respective elements of the array a(8,32) arestored in consecutive addresses of the memory in order of a(1,1) toa(8,1) and a(1,2) to a(8,2). When the vector register size is 512 bitand the size of one element of the array is 4 byte (single precision),the vector length is 16 as described above, and the sixteen arrayelements are stored in the memory using the capacity of 16×4 byte×8bit=512 bit. In FIG. 5, the arrow of VL indicates an address area of VLarray elements from the first array element in the memory.

Then, in the vector load instruction, the processor accesses pieces ofdata of sixteen array elements in the memory to store the sixteenelements into the vector register. In contrast, in the vector storeinstruction, the processor stores sixteen array elements stored in the512-bit vector register into the memory with the capacity of 512 bit.

In the top vector load instruction vload with the mask register of FIG.5, the former eight elements of the mask register MASK is TRUE (T) andthe latter eight elements of the mask register MASK is FALSE (F). Whenthe vector load instruction with this mask register is executed, theprocessor loads eight elements a(1,1) to a(8,1) of the array a(8,32) inthe memory into first eight elements of the vector register based on themask register. In this case, the processor accesses the address of thefirst element a(1,1) to the address of the eighth element a(8,1) in thememory, which are to be processed by the vector load instruction, storesloaded pieces of data of the eight elements into the first eightelements of the vector register.

In the second vector load instruction vload with the mask register,among sixteen elements of the mask register, the former four elements ofthe mask register is TRUE (T), and the latter remaining twelve elementsof the mask register is FALSE (F). In this case, the processor accessesthe addresses of the four elements a(1,1) to a(4,1) of the array a(8,32)in the memory based on the mask register, and loads read pieces of dataof the four elements into the first four elements of the vectorregister.

In this manner, when the vector instruction with the mask register isexecuted, the processor accesses a part of elements among the vectorlength VL elements in the memory, and stores the part of elements intothe vector register or stores the part of elements of the vectorregister into the memory. The mask register is usually generatedidentically to the code C3 when the compiler generates an assembly codeor object code. The compiler generates a mask register when the vectorinstruction does not process all the the vector length elements.

Referring back to FIG. 4, when the processor executes the code C3, asillustrated at the bottom of FIG. 4, the memory addresses of elements(elements for which mask register is TRUE) to be accessed by the vectorstore instruction vstore1 in the first iteration and the vector loadinstruction vload2 in the second iteration respectively based on themask register do not overlap with each other. However, the memoryaddresses (memory addresses indicated by arrow VL) of the vector lengthVL array elements to be accessed by the vector instructions vstore1 andvload2 partially overlap with each other. In this case, the memoryaddresses of array elements to be accessed by both the vectorinstructions based on the mask register do not overlap with each other,and thus SFI_1 does not occur, whereas the memory addresses of thevector length VL array elements of both the vector instructionspartially overlap with each other, and thus SFI_2 occurs.

Thus, when the decoder of the processor analyzes the code C3, thedecoder determines SFI_2 occurs, and the adjacent vector instructionsvstore1 and vload2 are executed by in-order execution. As a result, theprocessor does not execute those instructions by out-of-order execution,which increases the processing time.

Avoidance of SFI_2 by Conversion into Single Loop

FIG. 6 is a diagram illustrating a method of avoiding SFI_2 byconversion into a single loop. The source code C10 of FIG. 6 is the sameas the source code C1 of FIG. 4. In FIG. 4, the innermost loop iteratesby using the control variable j=1 to 8 for the vector length VL=16.Thus, when a instruction in the innermost loop is converted into avector instruction, the vector instruction accesses pieces of data ofthe first eight elements among the vector length 16 array elements. As aresult, the addresses of elements to be accessed by the former vectorstore instruction vstore1 and the latter vector load instruction vload2based on each mask do not overlap with each other. However, theaddresses of the vector length VL=16 elements overlap with each other.Therefore, SFI_1 has not occurred but SFI_2 has occurred.

In contrast, in the example of FIG. 6, the processor that executes thecompiler generates an intermediate code C12 that is changed to a singleloop by deleting the innermost loop (rows 04 to 06) of the source codeC10, and further generates an intermediate pseudo code C13 convertedinto a vector instruction (row 04) that accesses the vector length 16array elements. The vector instruction (row 04) of the intermediatepseudo code C13 is as follows.

a(1:16,i)=a(1:16,i)+b(1:16,i)

That is, the number of pieces of data (number of array elements) of thevector instruction is equal to the vector length VL=16, and the controlvariable i for the loop is changed to be a multiple of 16 within 1 to256.

The processor that executes the compiler generates an assembly pseudocode C14 based on the intermediate pseudo code C13. In this assemblypseudo code C14, the mask register ps0 is TRUE for all the vector length16 array elements. As a result, in the case of the assembly pseudo codeC14, the addresses of sixteen array elements to be accessed by thevector store instruction vstore1 and the vector load instruction vload2of the array a(1:16,i) do not overlap with each other. That is, none ofSFI_1 and SFI_2 occurs.

In this manner, when the array size 8 of the variable a(8:32) of thesource code C10 and the number of times 8 (j=1-8) of iteration of theinnermost loop (rows 04 to 06) match each other, a store instruction anda load instruction, which are consecutive across the loop, access piecesof element data of consecutive addresses on the memory, and thus theinnermost loop is deleted to obtain a single loop so that the storeinstruction and the load instruction access consecutive addresses. Thatis, when both the instructions are converted into vector instructionsfor elements with the vector length 16, SFI_1 does not occur and SFI_2can be avoided. It is indicated below the assembly pseudo code C14 ofFIG. 6 that the addresses of the vector length 16 array elements to beaccessed by the vector store instruction vstore1 in the first iterationand the vector load instruction vload2 in the second iterationrespectively do not overlap with each other.

Avoidance of SFI_2 by Unpacking

FIG. 7 is a diagram illustrating a method of avoiding SFI_2 byunpacking. In the source code C20 of FIG. 7, the innermost loop (rows 04to 06) has the iteration count 8 (j=1-8) for array variables a(9,32) andb(9,32). That is, the array size n of the array variable a(9,32) is n=9,whereas the iteration count m in the innermost loop is m=8, which aredifferent from each other. In this case, the method of changing to asingle loop as shown in FIG. 6 cannot be used due to n=9, m=8 asfollows.

When a instruction in the innermost loop of the source code C20 isvectorized, the intermediate pseudocode C21 is generated. Furthermore,when the intermediate pseudocode C21 is converted into an assembly code,the assembly pseudo code C22 is generated. In the assembly pseudo codeC22, the mask register ps0 corresponds to a single-precision array, andeight elements are TRUE, whereas the remaining eight elements are FALSEwithin the vector length 16. An access address relationship between thevector load instruction vload (row 05) and the vector store instructionvstore (row 07) for the variable a of this assembly pseudo code C22 isas illustrated in FIG. 8.

FIG. 8 is a diagram illustrating an access address relationship betweenthe vector load instruction vload (row 05) and the vector storeinstruction vstore (row 07) of the assembly pseudo code C22. FIG. 8illustrates the mask register MASK of the vector store instructionvstore1 in the first iteration and array elements a(1,1)-a(9,1) anda(1,2)-a(7,2), and the mask register MASK of the vector load instructionvload2 in the second iteration and array elements a(1,2)-a(9,2) anda(1,3)-a(2,3) of the variable a on the memory. In this case, theiteration count of the loop is 8 for the array size 9, and the arrayelement a(9,1) is not accessed. Thus, consecutive addresses are notaccessed. In this case, the method of changing to a single loop cannotbe used. As illustrated in the relationship between the processing orderand the memory address of FIG. 8, the addresses (address of element forwhich mask register is TRUE) on the memory to be accessed by the vectorstore instruction vstore1 in the first iteration and the vector loadinstruction vload2 in the second iteration respectively do not overlapwith each other, but the addresses of the vector length VL=16 elementsof both the instructions overlap with each other. That is, SFI_1 doesnot occur, but SFI_2 may occur.

Referring back to FIG. 7, the processor that executes the compilergenerates an assembly pseudo code C23 from the assembly pseudo code C22.Specifically, the processor that executes the compiler executes anunpacking instruction to change the size of one element of theinstruction of the assembly pseudo code C22 from 4 byte to 8 byte, tothereby change a vector instruction with the vector length 16 to avector instruction with the vector length 8.

“unpacked” of FIG. 5 illustrates a relationship between the array a onthe memory and the vector register in an unpacking instruction. FIG. 5is an example of the 4-byte single-precision vector length VL=16. Asillustrated in FIG. 5, eight elements each having 4 byte are stored intoa more significant bit side (or less significant bit side) of eightelements each having 8 byte in the vector register, and the lesssignificant bit side (or more significant bit side) is set to 0, forinstance. That is, on the assumption of VL=16, a packing instructionstores sixteen pieces of data into respective elements of 4 byte in thevector register, whereas an unpacking instruction stores eight pieces ofdata of 4 byte into respective elements of 8 byte. That is, theunpacking instruction stores eight pieces of data of 4 byte into thevector register at an interval of 8 byte.

In the assembly pseudo code C23 of FIG. 7, all the elements (eightelements) of the double-precision mask register pd0 (pd: precisiondouble) corresponding to an 8-byte double-precision element are TRUE,and the vector registers vd1 and vd2 of the vector instructions vloadand vstore with the array length 8 both have eight double-precision (vd:vector double) elements. Thus, the access address relationship betweenthe vector load instruction vload (row 05) and the vector storeinstruction vstore (row 07) for the variable a of the assembly pseudocode C23 is as illustrated in FIG. 9.

FIG. 9 is a diagram illustrating an access address relationship betweenthe vector load instruction vload (row 05) and the vector storeinstruction vstore (row 07) of the assembly pseudo code C23. FIG. 9illustrates the mask register MASK of the vector store instructionvstore1 in the first iteration and array elements a(1,1)-a(9,1) of thevariable a on the memory, and the mask register MASK of the vector loadinstruction vload2 in the second iteration and array elementsa(1,2)-a(2,2) of the variable a on the memory. The left and rightdirections correspond to memory addresses. The addresses of eightelements on the memory to be accessed by the vector store instructionvstore1 in the first iteration based on the mask register and theaddresses of eight elements on the memory to be accessed by the vectorload instruction vload2 in the second iteration based on the maskregister do not overlap with each other. This does not change the factthat SFI_1 does not occur. Furthermore, the addresses of the vectorlength 8 elements of the vector store instruction vstore1 in the firstiteration and the addresses of the vector length 8 elements of thevector load instruction vload2 in the second iteration do not alsooverlap with each other. With this, SFI_2 is avoided.

As illustrated in the relationship between the processing order of thevertical axis and the memory address of the horizontal axis of FIG. 9,the address of the vector length 8 array elements to be accessed by thevector store instruction vstore1 in the first iteration and the addressof the vector length 8 array elements to be accessed by the vector loadinstruction vload1 in the second iteration do not overlap with eachother, and thus it can be understood that SFI_2 can be avoided.

As described above, when the array size n and the iteration count m ofthe loop do not match each other (m<n), for instance, when the arraysize n is n=9 and the iteration count m is m=8, consecutive addressescannot be accessed, and conversion into a single loop as shown in FIG. 6cannot be used. However, when the iteration count m of the loop and thevector length VL has a specific ratio, for instance, 1:2, SFI_2 can beavoided by unpacking.

Example in which SFI_2 Cannot Be Avoided by Conversion into Single Loopor Unpacking

FIG. 10 is a diagram illustrating an example in which SFI_2 cannot beavoided by conversion into a single loop or unpacking. In the sourcecode C30 illustrated in FIG. 10, the array size n of the array variablea(4,32) is n=4, and the iteration count m of the loop in the innermostloop (rows 04 to 06) is m=3, which satisfies m<n. Then, when acalculation instruction in the innermost loop of the source code C30 isvectorized, an intermediate pseudo code C31 is generated. This is anexample of the innermost loop including a store instruction after a loadinstruction (store after load).

FIG. 10 illustrates the mask registers MASK of the vector loadinstruction vload1 and the vector store instruction vstore1 in the firstiteration of a loop (rows 03 to 05) of the intermediate pseudo code C31and arrays on the memory, and arrays on the memory of the vector loadinstructions vload2, vload3, and vload4 in the second to fourthiterations. Referring to FIG. 10, elements to be accessed by the vectorstore instruction vstore1 in the first iteration based on the maskregister and elements to be accessed by the vector load instructionsvload2 to vload4 in the second to fourth iterations based on the maskregister do not overlap with each other. However, the memory addressesof the vector length 16 elements of the vector store instruction vstore1in the first iteration and the memory addresses of the vector length 16elements of the vector load instructions vload2 to vload4 in the secondto fourth iterations overlap with each other. Thus, SFI_1 does notoccur, but SFI_2 may occur.

As described above, in the source code C30, the array size n=4 of thearray variable a(4,32) and the iteration count m=3 of the loop (same asnumber of elements to be accessed by vector instruction) in theinnermost loop (rows 04 to 06) do not match each other, namely, m<n issatisfied, therefore, the vector instruction does not access consecutiveaddresses when repetition is made in the innermost loop. Furthermore,the iteration count m=3 of the loop and the vector length VL=16 do nothave a specific ratio that enables use of an unpacking instruction,e.g., 1:2 or 1:4. Thus, in the example of the source code C30 of FIG.10, SFI_2 cannot be avoided by conversion into a single loop orunpacking described above.

First Embodiment: Example in which Array Size n of Variable, IterationCount m of Loop, and Vector Length VL Are Constants

Example in which Loop Includes Store Command after Load Command (Storeafter Load)

In a first embodiment, the processor that executes the compileroptimizes a loop illustrated in FIG. 10 in which the array size n andthe iteration count m of the loop are different from each other, namely,m<n is satisfied, and the iteration count m of the loop and the vectorlength VL do not have the specific ratio described above. Furthermore,the first embodiment is an example in which the array size n of avariable, the iteration count m of the loop, and the vector length VL ofa loop to be optimized are constants.

FIG. 11 is a diagram illustrating optimization processing (1) ofavoiding SFI_2 in the embodiment. This is an example (store after load)in which the innermost loop includes a store instruction after a loadinstruction. FIG. 11 illustrates the source code C30 of FIG. 10 and theintermediate pseudo code C31 in which a calculation instruction in theinnermost loop are vectorized.

FIG. 10 illustrates arrays a(1,1) to a(4,4) on the memory of the vectorload instruction vload1 and the vector store instruction vstore1 in thefirst iteration, and arrays a(1,1) to a(4,4) on the memory of the vectorload instructions vload2, vload3, and vload4 in the second to fourthiterations in an assembly code that is converted from the intermediatepseudo code C31. In each loop, the control variable i is incrementedfrom 1 to 32, and an array of j=1-3 is accessed by a vector instructionfor the control variable j. That is, the arrays a(1,1)-a(3,1) areaccessed in the first iteration, the array elements a(1,2)-a(3,2) areaccessed in the second iteration, the array elements a(1,3)-a(3,3) areaccessed in the third iteration, and the array elements a(1,4)-a(3,4)are accessed in the fourth iteration by respective vector instructions.Furthermore, the addresses of the array elements a(1,1)-a(3,1) to beaccessed by the vector store instruction vstore1 in the first iterationand the addresses of the array elements to be accessed by the vectorload instructions vload2, vload3, and vload4 in the second to fourthiterations respectively do not overlap with each other. This means thefact that SFI_1 does not occur between the vector store instructionvstore1 in the first iteration and the vector load instructions vload2,vload3, and vload4 in the subsequent second to fourth iterations. Theaddresses of the vector length 16 elements of the vector storeinstruction vstore1 in the first iteration and the addresses of thevector length 16 elements of the vector load instructions vload2,vload3, and vload4 in the second to fourth iterations overlap with eachother, and thus SFI_2 may occur.

In view of this, in this embodiment illustrated in FIG. 11, theprocessor that executes the compiler unrolls four times the innermostloop (rows 03 to 06) including a vector calculation instruction (row 04)of the intermediate pseudo code C31 after vectorization, so as togenerate an intermediate pseudo code C32. The loop (rows 03 to 05) ofthe intermediate pseudo code C31 is changed to an unrolled loop of rows03 to 08 in the intermediate pseudo code C32. Through four timesunrolling, the loop (rows 03 to 08) of the intermediate pseudo code C32includes the following four vector calculation instructions. Those fourvector calculation instructions are calculation instructions in thefirst to fourth iterations (i=1-4) of the original code C31.

a(1:3,i)=a(1:3,i)+b(1:3,i)

a(1:3,i+1)=a(1:3,i+1)+b(1:3,41)//original second iteration

a(1:3,i+2)=a(1:3,i+2)+b(1:3,i+2)//original third iteration

a(1:3,i+3)=a(1:3,i+3)+b(1:3,i+3)//original fourth iteration

Because of this, in the DO statement of the row 03, the control variablei is incremented to 32 by 4.

The number of times of unrolling is set to be 4 based on a function ofdividing the vector length VL=16 by the array size n=4 of the variable aand rounding up the remainder, namely, ceil(VL/n)=ceil(16/4)=4. That is,as illustrated in FIG. 10, the addresses of the vector length elementsof the vector store instruction vstore1 in the first iteration and theaddresses of elements (respective three elements) to be accessed basedon the mask registers of the vector load instructions vload1 to vload4in the first to fourth iterations overlap with each other, whereas theaddresses of the vector length elements of the vector store instructionvstore1 in the first iteration do not overlap with the addresses ofelements to be accessed based on the mask register of a vector loadinstruction in the fifth or subsequent iteration. This is the reason forfour times of unrolling.

FIG. 12 is a diagram illustrating loop unrolling and scheduling ofoptimization processing of avoiding SFI_2 in this embodiment. FIG. 12illustrates the addresses of the vector length elements of the vectorstore instruction vstore1 in the first iteration, which is identical tothat of FIG. 10, and the addresses of elements (respective threeelements) to be accessed based on the mask registers of the vector loadinstructions vload1 to vload4 in the first to fourth iterations.

As illustrated in FIG. 12, through the four times of unrolling, thevector load instructions vload2 to vload4 in the second to fourthiterations are positioned after (below in vertical direction) the vectorstore instruction vstore1 in the first iteration. Meanwhile, the vectorload instruction vload1 in the first iteration is positioned before(above in vertical direction) the vector store instruction vstore1 inthe first iteration. Furthermore, the addresses to be accessed by thevector store instruction vstore1 in the first iteration do not overlapwith the addresses to be accessed by the vector load instructions vload2to vload4 in the second to fourth iterations, with the result that SFI_1being original SFI does not occur.

Therefore, the processor that executes the compiler executes schedulingof moving the vector load instructions vload1 to vload4 in the second tofourth iterations before the vector store instruction vstore1 in thefirst iteration among the vector calculation instructions unrolled fourtimes. This is movement of an instruction indicated by the broken arrowof scheduling illustrated in FIG. 12. As a result of this scheduling,the vector load instructions vload1 to vload4 in the first to fourthiterations are positioned before the vector store instruction vstore1 inthe first iteration, and thus, SFI_2 does not occur between both theinstructions due to store after load and the processor that executes acompiled object code does not detect SFI. Thus, the processor canexecute the vector load instructions vload1 to vload4 in the first tofourth iterations before the vector store instruction vstore1 in thefirst iteration, and therefore can also execute four vector loadinstructions by out-of-order execution, to thereby be able to reduce thecalculation period by executing the four instructions in parallel, forinstance.

FIG. 11 illustrates the processing of scheduling of the intermediatepseudo code C32 subjected to loop unrolling in a pseudo manner. Theintermediate pseudo code C33 is generated by extracting a loadinstruction (temp #=a(1:3,#) where #=1-4) from four vector calculationinstructions of the intermediate pseudo code C32. The extracted loadinstruction is a instruction of loading and inputting a variablea(1:3,#) into a register temp #. As a result, a sum of the register temp# and a variable b(1:3,#) is stored in the variable a(1:3,#) after theextracted load instruction.

Then, the processor that executes the compiler executes scheduling ofmoving the vector load instructions vload2 to vload4 in the second tofourth iterations of the intermediate pseudo code C33 before the vectorstore instruction vstore1 in the first iteration, to thereby generate anintermediate pseudo code C34.

The number of times of unrolling is the number of vector loadinstructions for which the addresses of the vector length elements of avector store instruction and the addresses of elements to be accessedbased on the mask register overlap with each other. That is, the numberof times of unrolling is the value of the cell function ceil(VL/n) forthe vector length VL and the array size n of the variable a describedabove.

FIG. 13 is a diagram illustrating a relationship between the array sizen and the value of the cell function ceil(VL/n) in the case of vectorlength VL=16. When the vector length VL=16 is satisfied and the arraysize n of the variable a satisfies n<VL, n=1 to 15 is satisfied. FIG. 13illustrates the value of the ceil function ceil(VL/n) and the value ofceil(VL/n)−1 for n=1 to 16. The value of the ceil function ceil(VL/n) is4 in the case of the array size n=4 illustrated in FIG. 11 and FIG. 12.

Example of Loading after Storing when Loop Includes Load Instructionafter Store instruction

FIG. 14 is a diagram illustrating optimization processing (2) ofavoiding SFI_2 in the embodiment. FIG. 14 is an example of loading afterstoring when the innermost loop includes a load instruction after astore instruction. In the source code C30_2, the innermost loop (rows 04to 07) includes a store instruction (row 05) for storing the arrayvariable b into the array variable a in the memory and a loadinstruction (row 06) for loading the variable a from the array variablea in the memory. In the intermediate pseudo code C31_2, a plurality ofinstructions to be repeated in the innermost loop in C30_2 is convertedinto vector instructions (rows 04 and 05).

FIG. 15 is a diagram illustrating loop unrolling and scheduling ofoptimization processing of avoiding SFI_2 in this embodiment. FIG. 15illustrates the vector length VL array elements a(1,1) to a(4,4) on thememory of the vector store instruction vstore1 in the first iteration,and array elements a(1,2) to a(4,2), a(1,3) to a(4,3), and a(1,4) toa(4,4) on the memory of the vector load instructions vload1, vload2, andvload3 in the first to third iterations in a case where the intermediatepseudo code C31_2 is converted into an assembly code. In each loop, thearray elements a(j,i) with the control variable i being incremented from1 to 32 and the control variable j=1-3 are accessed by a vectorinstruction. In the vector load instruction, the array elementsa(1,2)-a(3,2) are accessed in the first iteration, the array elementsa(1,3)-a(3,3) are accessed in the second iteration, the array elementsa(1,4)-a(3,4) are accessed in the third iteration, and these arrayelements are accessed by respective vector instructions. Furthermore,the addresses of the array elements a(1,1)-a(3,1) to be accessed by thevector store instruction vstore1 in the first iteration based on themask register and the addresses of the array elements to be accessed bythe vector load instructions vload1, vload2, and vload3 in the first tothird iterations respectively based on the mask register do not overlapwith each other. This means the fact that SFI_1 does not occur betweenthe vector store instruction vstore1 in the first iteration and thevector load instructions vload1, vload1, and vload3 in the subsequentfirst to third iterations. However, SFI_2 may occur.

Referring back to FIG. 14, in this embodiment, the processor thatexecutes the compiler unrolls three times the vector calculationinstructions (rows 04 and 05) of the vectorized intermediate pseudo codeC31-2 to generate an intermediate pseudo code C32_2. The loop (rows 03to 06) of the intermediate pseudo code C31-2 is changed to a loop ofrows 03 to 10 in the intermediate pseudo code C32_2. Through three timesof unrolling, the loop (rows 03 to 10) of the intermediate pseudo codeC32-2 includes the following three sets of vector calculationinstructions. Those three sets of vector calculation instructions arecalculation instructions in the first to third iterations (i=1-3) of theoriginal code C31_2.

a(1:3,i)=b(1:3,i)

c(1:3,i)=a(1:3,i+1)+i

a(1:3,i+1)=b(1:3,i+1)//original second iteration

c(1:3,i+1)=a(1:3,i+2)+i+1//original second iteration

a(1:3,i+2)=b(1:3,i+2)//original third iteration

c(1:3,i+2)=a(1:3,i+3)+i+2//original third iteration

Because of this, in the DO statement of the row 03, the control variablei is incremented by 3.

The number of times of unrolling is set to be 3 based on the cellceil(VL/n)−1=ceil(16/4)−1=3. That is, the addresses of the vector lengthelements of the vector store instruction vstore1 in the first iterationand the addresses of elements to be accessed based on the mask registersof the vector load instructions vload1 to vload3 in the first to thirditerations overlap with each other, whereas the addresses of the vectorlength elements of the vector store instruction vstore1 in the firstiteration do not overlap with the addresses of elements to be accessedbased on the mask register of a vector load instruction in the fourth orsubsequent iteration. This is the reason for three times of unrolling.

FIG. 15 illustrates the addresses of the vector length elements of thevector store instruction vstore1 in the first iteration, which isidentical to that of FIG. 12, and the addresses of elements to beaccessed based on the mask registers of the vector load instructionsvload1 to vload3 in the first to third iterations. As illustrated inFIG. 15, through the three times of unrolling, the vector loadinstructions vload1 to vload3 in the first to third iterations arepositioned after the vector store instruction vstore1 in the firstiteration. Furthermore, the addresses to be accessed by the vector storeinstruction vstore1 in the first iteration based on the mask register donot overlap with the addresses to be accessed by the vector loadinstructions vload1 to vload3 in the first to third iterations based onthe mask registers, with the result that SFI_1 being original SFI doesnot occur.

In view of this, in this embodiment, the processor that executes thecompiler executes scheduling of moving the vector load instructionsvload1 to vload3 in the first to third iterations before the vectorstore instruction vstore1 in the first iteration among the vectorcalculation instructions unrolled three times. This scheduling ismovement of instructions indicated by the broken arrows illustrated inFIG. 15. As a result of this scheduling, the vector load instructionsvload1 to vload3 in the first to third iterations are positioned beforethe vector store instruction vstore1 in the first iteration, and thus,the processor that executes a compiled object code does not detect SFI_2between the three vloads and the vstore. Thus, the processor can executethe vector load instructions vload1 to vload3 in the first to thirditerations before the vector store instruction vstore1 in the firstiteration, and can also execute three vector load instructions byout-of-order execution, to thereby be able to reduce the calculationperiod by parallel processing, for instance.

FIG. 14 illustrates the processing of scheduling of the intermediatepseudo code C32-2 subjected to loop unrolling in a pseudo manner. Theintermediate pseudo code C33-2 and an intermediate pseudo code C34-2after scheduling correspond to the codes C33 and C34 of FIG. 11,respectively. Thus, description thereof is omitted here.

In the example of loading after storing in the innermost loop, thenumber of times of unrolling is the number of vector load instructionsfor which the addresses of the vector length elements of a vector storeinstruction and the addresses of elements to be accessed based on themask register overlap with each other. The positions of array elementsof the vector store instruction vstore1 and the vector load instructionvload1 deviate from each other in the first iteration, and thus thenumber of times of unrolling is the value of ceil(VL/n)−1 obtained bysubtracting one from the ceil function for the vector length VL and thearray size n of the variable a. This is a value lower than the examplesof FIG. 11 and FIG. 12 by one.

Summary of Example of Storing after Loading in Loop and Example ofLoading after Storing in Loop

Example of Storing after Loading in Loop

FIG. 16 is a diagram illustrating loop unrolling and scheduling in thecase of a source code in which the loop includes a load instructionbefore a store instruction. In a same manner as in FIG. 11, thepreceding load instruction and the subsequent store instruction areinstructions that access the same element of the same variable. FIG. 16illustrates a relationship between memory addresses (horizontaldirection) of the vector store instruction vstore and the vector loadinstruction vload of each loop after loop unrolling and scheduling. InFIG. 16, the vector length VL indicates the position of a memory addressof the vector length array elements to be accessed by the storeinstruction vstore1 in the first iteration.

This embodiment discusses loop structure that enables avoidance ofSFI_2, namely, loop structure in which the iteration count m of the loopis lower than the array size n (m<n) and the array size n is equal to orlower than the vector length VL (n=<VL). Thus, in the mask registers ofthe vector load instruction vload and the vector store instruction, theformer m elements of the array are TRUE (T), the next n-m elements areFALSE (F), and the last VL-n elements are also FALSE (F). Furthermore,the addresses of the first elements of both the instructions deviatefrom each other by the array size n in the first iteration and thesecond iteration. The same applies to the third and subsequentiterations. Therefore, in a same manner as in FIG. 11, when VL=16 andn=4 are set, SFI_2 may occur between the vector store instructionvstore1 in the first iteration and the vector load instructions vload2to vload4 in the second to fourth iterations.

In this example, the processor that executes the compiler executes fourtimes of loop unrolling for the loop to be optimized, to expand (unroll)the vector load instructions vload and the vector store instructionsvstore in the first to fourth iterations as illustrated in FIG. 16.Then, the processor that executes the compiler executes scheduling ofmoving the vector load instructions vload2 to vload4 in the second tofourth iterations, which are located after the first vector storeinstruction vstore1, before the first vector store instruction vstore1.As can be understood from the state after scheduling, the four vectorload instructions vload1 to vload4 are positioned before the four vectorstore instructions vstore1 to vstore4, and thus the processor thatexecutes the compiled object code does not determine SFI_2. Then, theprocessor can execute the four vector load instructions vload1 to vload4by out-of-order execution.

Example of Loading after Storing in Loop

FIG. 17 is a diagram illustrating loop unrolling and scheduling in thecase of a source code in which the loop includes a store instructionbefore a load instruction. In a same manner as in FIG. 14, the precedingstore instruction and the subsequent load instruction in the same loopare instructions that access different elements of the same variable.Furthermore, in the first iteration, the addresses of first elements ofa store instruction and a load instruction deviate from each other bythe array size n. The same applies to the second and subsequentiterations. Therefore, when VL=16 and n=4, SFI_2 may occur between thevector store instruction vstore1 in the first iteration and the vectorload instructions vload1 to vload3 in the first to third iterations.

In this example, the processor that executes the compiler executes threetimes of loop unrolling for the loop to be optimized, to expand (unroll)the vector load instructions vload and the vector store instructionsvstore until the third iteration as illustrated in FIG. 17. Then, theprocessor executes scheduling of moving the vector load instructionsvload1 to vload3 in the first to third iterations, which are locatedafter the first vector store instruction vstore1, before the firstvector store instruction vstore1. Under the state after scheduling, thethree vector load instructions vload1 to vload3 are positioned beforethe three vector store instructions vstore1 to vstore3, and thus theprocessor that executes the compiled object code does not determineSFI_2. Therefore, the processor can execute the three vector loadinstructions vload1 to vload3 by out-of-order execution.

Second Embodiment: Example in which Array Size n of Variable andIteration Count m of Loop are Variables, and Vector Length VL isConstant or Unknown

Also in a second embodiment, the processor that executes the compileroptimizes a loop in which the array size n illustrated in FIG. 10 andthe iteration count m of the loop are different from each other, namely,m<n is satisfied, and the iteration count m of the loop and the vectorlength VL do not have a ratio that enables application of unpackingdescribed above. Furthermore, the second embodiment is an example inwhich the array size n of a variable and the iteration count m of theloop to be optimized are variables, and the vector length VL is aconstant or unknown. The example of n>VL is excluded from the loop to beoptimized. The meaning of the array size n and the iteration count m ofthe loop being variables is that the values of both variables will beestablished when the processor executes the compiled object code, andboth the variables are not established at the time of compiling.Furthermore, the meaning of the vector length VL being unknown is thatthe vector length of the processor that executes the object codeobtained by compiling the source program is unknown at the time ofcompiling.

Example in which Loop Includes Store Command after Load Command

FIG. 18 is a diagram illustrating an example of a source code includingthe loop to be optimized and an intermediate pseudo code aftervectorization in the second embodiment. This example is an example inwhich the innermost loop (rows 04 to 06) includes a store instructionafter a load instruction in the source code C40. In the source code C40,the array size n of the array variables a(n,32), b(n,32), c(n,32) is avariable, the iteration count m of the innermost loop is a variable, andthe vector length VL is a constant or unknown. In this case, m=<n issatisfied logically, and m=<VL is assured when the loop of n>VL isexcluded from the loop to be optimized. For example, FIG. 20 shows asource code C30 with m=3, n=4, VL=16, m=<n<16.

When the innermost loop of the source code C40 is vectorized by theiteration count m, an intermediate pseudo code C41 after vectorizationis generated. In the intermediate pseudo code C41, the innermost loop(rows 04 to 06) of C40 are converted into the following calculationinstructions.

do j = 1, m, VL //VL: 16 a(j:j+VL−1,i) = a(j:j+VL−1,i) + b(j:j+VL−1,i)enddo

In the case of the intermediate pseudo code C41 after vectorization,SFI_2 may be true depending on the relationship between the array size nand the vector length VL. As illustrated in FIG. 10, SFI_2 in this casecorresponds to a case in which the addresses of the vector lengthelements of both instructions overlap with each other between the vectorload instruction and the subsequent vector store instruction, but theaddresses of elements to be accessed by both the instructions based onthe mask registers do not overlap with each other.

In this embodiment, the compiler executes optimization processing ofavoiding SFI_2 for such a loop. The outline of the optimizationprocessing of avoiding SFI_2 is as follows.

(1) When the declared array size n of the array variable is a variablein the source code, the array size n is unknown at the time ofcompiling. Thus, the loop (rows 03 to 07) of the vectorized code C41 isconverted into case-specific loops each including a vector loadinstruction and a vector store instruction in each innermost loop (rows03 to 07) separately for each value (separately for each case) of afunction ceil(VL/n) (or ceil(VL/n)−1 when there is a load instructionafter a store instruction in the loop) of dividing the vector length VLby the array size n and rounding up the remainder.

FIG. 19 is a diagram illustrating an intermediate pseudo code C42 havingbranch structure in which the above-mentioned loop (rows 03 to 07) isconverted into case-specific loops. The code C42 is an example ofsetting the vector length VL to VL=16. In the code C42, the rows 02 to34 are case-specific loops in a case where the vector length VL isincluded in the array VL_array (array storing vector length VL thatcauses SFI_2) described later and the condition of if statement havingthe condition of n<=VL is true. The rows 35 to 42 is a loop in a casewhere the condition of if statement is false.

In the case-specific loops of the rows 02 to 34, a vector calculationinstruction (row 04) of the innermost loop (rows 04 to 06) of the codeC41 is described in each of cases 0 to 16 (case 0 to case 16)corresponding to a case in which the condition ceil(VL/n) of selectstatement of the row 03 is 0 to 16. In the case of the condition of ifstatement n<=VL is true, m<=VL is satisfied, and thus a vectorcalculation instruction of the innermost loop (rows 04 to 06) of thecode C41 is set to be a code of rows 06 to 08, for instance, in the codeC42 of FIG. 19. The following code is given.

06 do i = 1, 32 07 a(1:m,i) = a(1:m,i) + b(1:m,i) 08 enddo

As illustrated in ceil(VL/n) of FIG. 13, when the arrangement size n isn=8 to 15, ceil(VL/n)=2 is satisfied, which corresponds to the case 2 inthe code C42 of FIG. 19.

(2) Next, loop unrolling (expanding) is performed by the numberceil(VL/n) of times of unrolling for each case-specific loop of the codeC42 of FIG. 19, and scheduling of moving a vector load instructionsubsequent to the first vector store instruction before the first vectorinstruction is performed among the unrolled instructions.

FIG. 20 is a diagram illustrating an intermediate pseudo code C43obtained by loop unrolling each case-specific loop of the intermediatepseudo code C42, which has branch structure converted into thecase-specific loops of FIG. 19, by the corresponding number of times ofunrolling. In FIG. 20, scheduling is not executed yet. In the code C43,a vector calculation instruction is unrolled in the case-specific loopof cases 2 to 8 of rows 09 to 37. In the cases 2 and 3 of rows 09 to 19,loop unrolling is performed by the corresponding number (ceil(VL/n)=2and 3) of times of cases. In contrast, in the cases 4, 6, and 8 of rows20 to 37, the array statement size n is equal to or lower than thenumber ceil(VL/n)=4, 6, 8 of cases, and thus the number of times ofunrolling is restricted to the array statement size n. The case 16 ofrows 38 to 43 is an example of ceil(VL/n)=16, and the array size n=1 issatisfied. Thus, the iteration count m satisfies m=n=1, andvectorization is not performed, which returns to the loop structure of ascalar instruction of the original code C40. Rows 44 to 50 satisfy thecondition of n>VL, and SFI_2 does not occur, which keeps the loopstructure of the code C41.

Example in which Loop includes Load Command after Store Command

FIG. 21 is a diagram illustrating an example of a source code includingthe loop to be optimized and an intermediate pseudo code aftervectorization in the second embodiment. This example is an example inwhich the innermost loop (rows 04 to 07) includes a load instructionafter a store instruction. In the source code C50, the array size n ofthe array variables a(n,33), b(n,33), c(n,33) is a variable, theiteration count m of the innermost loop is a variable, and the vectorlength VL is a constant or unknown. In this case, m=<n is satisfiedlogically, and m=<VL is assured when the loop of n>VL is excluded fromthe loop to be optimized.

When the innermost loop of the source code C50 is vectorized by theiteration count m, an intermediate pseudo code C51 after vectorizationis generated. In the intermediate pseudo code C51, the innermost loop(rows 04 to 07) of C50 are converted into the following calculationinstructions.

do j = 1, m, VL //VL: 16 a(j: j+VL−1, i) = b(j: j+VL−1 ,i) c(j: j+VL−1,i) = a(j: j+VL−1, i+1) + i enddo

Also when the loop includes a load instruction after a storeinstruction, identical to the case (FIGS. 18, 19, and 20) of the loopincluding a store instruction after a load instruction, the optimizationprocessing of avoiding SFI_2 by the compiler involves (1) converting thevectorized intermediate pseudo code C51 into a case-specific loop, and(2) executing loop unrolling and scheduling in each loop. The loopincludes a load instruction after a store instruction, therefore thenumber of times of unrolling is Ceil(VL/n)−1, which is different fromthe case of the loop including a store instruction after a loadinstruction.

FIG. 22 is a diagram illustrating an intermediate pseudo code C52 havingbranch structure in which the innermost loop (rows 03 to 07) of thevectorized intermediate pseudo code C51 is converted into case-specificloops. This case is equivalent to FIG. 19 except for the number of timesof unrolling, and thus description of FIG. 22 is omitted.

FIG. 23 is a diagram illustrating an intermediate pseudo code C53obtained by unrolling each case-specific loop of the intermediate pseudocode C52 having branch structure in which the vectorized innermost loopis converted into the case-specific loops of FIG. 22 by thecorresponding number of times of unrolling. FIG. 23 is equivalent toFIG. 20 except that the number of times of unrolling is ceil(VL/n)−1,and thus description of FIG. 23 is omitted.

Optimization Processing of Compiler for First and Second Embodiments

Next, description is given of the optimization processing of thecompiler with reference to a flow chart. The optimization processing ofthe compiler described below is optimization processing to be executedfor both of the loop to be optimized in the first embodiment and theloop to be optimized in the second embodiment. The loop to be optimizedin the first embodiment is a loop in which the vector length VL and thearray size n of the variable are constants, n<VL is true, and a programin the loop may cause SFI_2. The loop to be optimized in the secondembodiment is a loop in which the vector length VL and the array size nof the variable are variables, whether n<VL is unknown, and a program inthe loop may cause SFI_2.

FIG. 24 is a diagram illustrating a flow chart of overall optimizationprocessing of the compiler. The processor that executes the compileroptimizes predetermined loop structure in an intermediate code generatedfrom a source program so that SFI_2 can be avoided by conversion into asingle loop (deletion of innermost loop) or an unpacked storeinstruction and load instruction (S0).

Then, the processor that executes the compiler analyzes loop structurefor a loop that cannot be optimized in the processing S0, and generatesloop structure that stores loop structure information for each loop(S1). The loop to be analyzed is a loop including a vectorized storeinstruction and load instruction.

FIG. 25 is a diagram illustrating an example of the loop structure. Theloop structure LOOP_STRUCT is data structure that stores a parameterincluding, for instance, loop features used for optimizing the loop. Theloop structure includes parameters, e.g., whether the loop is a loop tobe optimized (TRUE/FALSE), whether the loop type Loop_type is 1 or 2,the number of times of unrolling, whether the vector length VL isunknown, whether SFI_2 is to occur (TRUE/FALSE), whether the typeSFI_2_type of SFI_2 is 0 or 1, and the array (VL_array) with the vectorlength VL that causes SFI_2.

The loop type 1 indicates a loop in which VL and n are constants andn<VL is true, which is the loop in the first embodiment. The loop type 2indicates a loop in which VL is a constant or unknown, n is a variable,and n<VL is unknown, which is the loop in the second embodiment. Thenumber of times of unrolling is ceil(VL/n)−SFI_2_type(0 or 1) in thecase of the loop type 1, and NULL in the case of the loop type 2. Thetype SFI_2_type of SFI_2 is SFI_2_type=0 when the loop includes a storeinstruction after a load instruction, and is SFI_2_type=1 when the loopincludes a load instruction after a store instruction. In the loop type2, the vector length VL is unknown, and thus the processor that executesthe compiler checks the vector length that causes SFI_2, and stores thevalue of the vector length VL for which occurrence of SFI_2 is detectedinto the vector length array VL array that causes SFI_2. In FIG. 25,VL=2, 4, 8, 16 are stored in Loop_type=2.

Referring back to FIG. 24, the processor that executes the compileranalyzes the loop structure of each loop, and generates theabove-mentioned loop structure (S1). Then, the processor that executesthe compiler determines whether each loop of the intermediate code is aloop to be optimized by referring to the loop structure (S2). Theprocessor executes the processing S3, S3-1, S3-2 a, and S3-2 b for allthe loops to be optimized (YES in S2), and finishes the processing whenthere is no loop to be optimized (NO in S2).

The processor that executes the compiler determines whether the loop tobe optimized is the loop type 1 or the loop type 2 based on the loopstructure (S3). When the loop type Loop_type=1 is satisfied, theprocessor unrolls [Ceil(VL/n)−SFI_2_type] times instructions in the loop(or only n times instructions in the loop when[Ceil(VL/n)−SFI_2_type]>array size n), and moves (schedules) the vectorload instruction (or instructions) subsequent to the vector storeinstruction in the first iteration before the vector store instructionin the first iteration among the unrolled instructions (S3-1). The movedvector load instruction (or instructions) is (are) a load instruction(or instructions) in the [1-SFI_2_type]-th and subsequent iterations.The above-mentioned processing S3-1 in the case of the loop type 1 hasbeen described with reference to FIG. 11 and FIG. 12, FIG. 16, and FIG.18 to FIG. 20.

Meanwhile, in the case of the loop type 2, the processor that executesthe compiler converts the loop into a program (branch program) havingcase-specific loops each of which branches depending on Ceil(VL/n)SFI_2_type (S3-2 a). Furthermore, the processor unrolls[Ceil(VL/n)−SFI_2_type] times instructions for each case-specific loopof the branch program (or unrolls only n times instructions when[Ceil(VL/n)−SFI_2_type]>array size n). Furthermore, the processor movesthe vector load instruction (or instructions) subsequent to the vectorstore instruction in the first iteration before the vector storeinstruction in the first iteration among the unrolled instructions (S3-2b). The moved vector load instruction (or instructions) is (or are) aload instruction (or instructions) in the [2-SFI_2_type]-th andsubsequent iterations. The above-mentioned processing S3-2 a and S3-2 bin the case of the loop type 2 has been described with reference to FIG.14 and FIG. 15, FIG. 17, and FIG. 21 to FIG. 23.

FIG. 26 is a diagram illustrating a flow chart of the processing of loopstructure analysis S1. The processor that executes the compilerdetermines whether the following conditions are all TRUE (S11).Specifically, the conditions are whether the loop of the intermediatecode is a double or more loop (multi-loop), whether the innermost loopincludes one control variable (j), whether there is one set of thevector store instruction and the vector load instruction with the samearray variable (a) in the innermost loop, and whether there is novariable having the same variable type as that of the target arrayvariable (a) and having a different array size (number of elements n).The processor executes the following processing when the result ofdetermination is YES, or finishes the processing of S1 when the resultof determination is NO.

The processor that executes the compiler acquires the array size (n) andthe size of the array variable updated in the innermost loop (S12).Then, the processor divides the vector register size (byte number) bythe type size (byte number) of the variable to calculate the vectorlength VL (S13). When the array size n is a constant, the vector lengthVL is determined, whereas when the array size n is a variable, thevector length VL is unknown.

Then, the processor determines whether n<VL is true, and calculatesceil(VL/n) (S14). Then, the processor executes the followingdetermination of loop types.

S15: Determine the loop type as the type 1 (TYPE_1) when the vectorlength VL is a constant, the array size n is a constant, and n<VL istrue.

S16: Determine the loop type as the type 2 (TYPE_2) when the vectorlength VL is a constant, the array size n is a variable, and n<VL isunknown.

S17: Determine the loop type as the type 2 (TYPE_2) when the vectorlength VL is unknown, the array size n is a variable, and n<VL isunknown.

Finish the processing when all the determinations of the above-mentionedprocesses S15 to S17 result in NO.

When the loop type is the type 1 or the type 2, the processor executesdetermination processing of SFI_2 of FIG. 27 and subsequent FIGs.

FIG. 27 and FIG. 28 are diagrams illustrating flow charts of thedetermination processing S20 of SFI_2. The processor that executes thecompiler determines whether the loop has the loop type 1 or the looptype 2 (S21), and when the loop has the loop type 1, the processorexecutes the following processing S40, S22, S23, and S23(2).

Loop Type 1

The processor that executes the compiler analyzes a relationship betweenconsecutive vector store instruction and vector load instruction (S40).In this analysis, the processor determines whether the loop causes SFI_2(SFI_2=TRUE or FALSE), and whether the load instruction is before thestore instruction (SFI_2_type=0) or after the store instruction(SFI_2_type=1). Details thereof are described with reference to FIG. 29and FIG. 30. When the loop causes SFI_2 (SFI_2=TRUE) (YES in S22), theprocessor records SFI_2 TRUE and SFI_2_type=0 or 1 into the loopstructure, and sets VL_array to NULL (S23). The reason for settingVL_array to NULL is that the vector length VL is a constant, and is notto be checked. When the loop does not cause SFI_2 (SFI_2=FALSE) (NO inS22), the processor records SFI_2=FALSE into the loop structure(S23(2)).

Loop Type 2

In the case of the loop type 2, the processor that executes the compilerdetermines whether the vector length VL is unknown (S24). When thevector length VL is unknown (FALSE in S24), the processor executes thefollowing processing while incrementing the array size n being avariable from 1 to the vector length VL, that is, for each value of 1 tothe vector length VL of the array size n (S26). That is, when the arraysize n=VL is reached and ceil(VL/n)=1 is true (TRUE in S27), theprocessor finishes the processing, whereas when the ceil(VL/n)=1 isfalse (FALSE in S27), the processor analyzes a relationship betweenconsecutive vector store instruction and vector load instruction (S40).

When the loop causes SFI_2 (SFI_2=TRUE) (TRUE in S22), the processorrecords SFI_2=TRUE and SFI_2_type=0 or 1 into the loop structure, andsets VL array to NULL (S23). Then, the processor finishes thedetermination processing S20 of SFI_2. When the loop does not causeSFI_2 (SFI_2=FALSE) (FALSE in S22), the processor returns to S26 andincrements the array size n when all the array sizes n are not checked(NO in S33). When all the array sizes n are finished (YES in S33), theprocessor records the fact that the loop does not cause SFI_2(SFI_2=FALSE) (S23(2)), and finishes the processing. In other words,when the processor has detected SFI_2=TRUE at least once whileincrementing the array size n from 1 to VL, the processor determinesthat the loop may cause SFI_2.

Referring to FIG. 28, when the loop has the loop type 2 and the vectorlength VL is unknown (TRUE in S24), the processor executes theprocessing S26 to S30, and S33 while changing the vector length to 2, 4,8, . . . , that is, for each vector length (power of 2) to be taken(S25). Among the processing S26 to S30, the processing S26, S27, S40,S28, S33 excluding S29 and S30 are the same as those of FIG. 27. In theprocessing S29, the processor stores SFI_2=TRUE and SFI_2_type=0 or 1into the loop structure, and stores the current VL into the vectorlength array VL_array. The processor repeats the above-mentionedprocessing until determination of all the vector lengths VL is finished(S30). In other words, the processor repeats the processing S26, S27,S40, S28, S29, S33 until SFI_2=TRUE is determined while changing thearray size n for each of all the vector lengths VL. In this manner, theprocessor detects all the vector lengths VL that cause SFI_2, andregisters all the detected VLs in the vector length array VL array inthe loop structure.

When determination of all the vector lengths VL is finished, and thevector length array VL_array of the loop structure is vacant (YES inS31), the processor records SFI_2=FALSE and the loop type=NULL into theloop structure (S32). In this case, SFI_2 does not occur in all the VLs,and thus the analyzed loop is excluded from the loop to be optimized.

Determination S40 of Relationship between Vector Store Command andVector Load Command

FIG. 29 and FIG. 30 are diagrams illustrating flow charts of theprocessing S40 of determining a relationship between a vector storeinstruction and a vector load instruction in the same loop and inproceeding and subsequent loops. The processor that executes thecompiler determines whether there is a store instruction after a loadinstruction in the same loop (S41). The determination results in true(TRUE in S41), the processor executes the processing S42 to S45. Whenthe determination results in false (FALSE in S41), the processorexecutes the processing of FIG. 30.

When the determination of S41 results in TRUE, the processor firstdetermines whether the addresses of the number of elements commensuratewith a vector length with respect to the first element of a vector storeinstruction vstore1 of a former loop and the addresses of the vectorlength elements of a vector load instruction vload2 of a latter loopoverlap with each other between loops in loop structure (S42). When thedetermination S42 results in FALSE, SFI_1 does not occur, and SFI_2 alsodoes not occur. Thus, the processor records the fact that SFI_2 is FALSEinto the loop structure (S45). A relationship between the vector storeinstruction vstore1 in the first iteration and a vector load instructionvload1 in the second iteration is indicated above the block of S45.

When the determination S42 results in TRUE, the processor nextdetermines whether the addresses of elements to be accessed by a vectorstore instruction in a former loop based on the mask register and theaddresses of elements to be accessed by a vector load instruction in alatter loop based on the mask register overlap with each other betweenloops in loop structure (S43). When the determination S43 results inFALSE, SFI_2 may occur, and thus the processor records the fact thatSFI_2 is TRUE and SFI_2_type=0 into the loop structure (S44). When thedetermination S43 results in TRUE, SFI_1 occurs, and SFI_2 does notoccur. Thus, the processor records the fact that SFI_2 is FALSE into theloop structure (S46). A relationship between the vector storeinstruction vstore1 in the first iteration (the preceding loop) and avector load instruction vload1 in the second iteration (the subsequentloop) is indicated below the blocks of S44 and S46.

As illustrated in FIG. 30, when the determination of S41 is FALSE, theprocessor first determines whether the addresses of the vector lengthelements of a vector store instruction vstore1 of a former loop and theaddresses of the vector length elements of vector load instructionsvload1 or vload2 of the same former loop or latter loop overlap witheach other (S52). That is, the processor determines whether theaddresses of the vector length elements overlap with each other betweenthe former vector store instruction vstore1 and the latter vector loadinstruction vload1 in the same loop. When the determination S52 resultsin FALSE, SFI_1 does not occur, and SFI_2 also does not occur. Thus, theprocessor records the fact that SFI_2 is FALSE into the loop structure(S55). A relationship between the former vector store instructionvstore1 and the latter vector load instruction vload1 in the same loopis indicated above the block of S55.

When the determination S52 results in TRUE, the processor nextdetermines whether the addresses of elements to be accessed by a vectorstore instruction vstore1 in a former loop based on the mask registerand the addresses of elements to be accessed by vector load instructionsvload1 or vload2 in the same former loop or a latter loop based on themask registers overlap with each other (S53). That is, the processordetermines whether the addresses of elements to be accessed by theformer vector store instruction vstore1 based on the mask register andthe addresses of elements to be accessed by the latter vector loadinstruction vload1 based on the mask register overlap with each other inthe same loop. When the determination results in FALSE, SFI_2 may occur,and thus the processor records the fact that SFI_2 is TRUE andSFI_2_type=1 into the loop structure (S54). When the determination S53results in TRUE, SFI_1 occurs, and SFI_2 does not occur. Thus, theprocessor records the fact that SFI_2 is FALSE into the loop structure(S56). A relationship between the former vector store instructionvstore1 and the latter vector load instruction vload1 in the same loopis indicated below the blocks of S54 and S56.

As described above, the processor executes the optimization processingof the compiler in this embodiment, and generates an assembly code orobject code. With this, the processor that executes the compiler canoptimize the intermediate code such that the processor that executes anobject code can avoid determination of occurrence of SFI_2 in the loopstructure. Furthermore, even when the array size n is a variable and thevector length VL is unknown in the loop structure, the processor canoptimize the intermediate code such that the processor avoidsdetermination of SFI_2 for the array size n and the vector length VL,which are determined when the processor executes the object code. Whendetermination of SFI_2 can be avoided, the processor can execute loadinstructions by out-of-order execution.

When the array size n and the iteration count m of a loop does not matcheach other, namely, m<n is satisfied, and consecutive addresses are notaccessed, and when the array size n and the vector length VL do not havea specific ratio, the processor that executes the compiler can optimizethe program so that determination of SFI_2 can be avoided. Also when thevector length VL is a constant or unknown or the array size n is aconstant or a variable at the time of compiling, for instance, theprocessor can optimize the program such that determination of SFI_2 canbe avoided in accordance with the array size n and the vector length VL,which are determined when the compiled object code is executed.

All examples and conditional language provided herein are intended forthe pedagogical purposes of aiding the reader in understanding theinvention and the concepts contributed by the inventor to further theart, and are not to be construed as limitations to such specificallyrecited examples and conditions, nor does the organization of suchexamples in the specification relate to a showing of the superiority andinferiority of the invention. Although one or more embodiments of thepresent invention have been described in detail, it should be understoodthat the various changes, substitutions, and alterations could be madehereto without departing from the spirit and scope of the invention.

1. A non-transitory computer readable storage medium that stores thereina compiler program causing a computer to execute optimization processingfor an optimization target program, the optimization processingcomprising: the optimization target program including a loop including avector store instruction and a vector load instruction for an arrayvariable, unrolling the vector store instruction and the vector loadinstruction in the loop by an unrolling number of times to generate aplurality of unrolled vector store instructions and a plurality ofunrolled vector load instructions, the unrolling number of timesincluding a first unrolling number of times, which is obtained bydividing a vector length by an array size of the array variable androunding up a remainder, or a second unrolling number of times lowerthan the first unrolling number of times by one; and scheduling to movean unrolled vector load instruction among the plurality of unrolledvector load instructions, which is located after a first unrolled vectorstore instruction that is located at first among the plurality ofunrolled vector load instructions, before the first unrolled vectorstore instruction.
 2. The compiler program according to claim 1,wherein, in the loop, the array size of the array variable is a constantand the vector length is a constant, a iteration count of the loop islower than the array size, a ratio between the iteration count of theloop and the vector length is not a specific ratio that enablesunpacking.
 3. The compiler program according to claim 2, wherein theloop of the optimization target program corresponds to a secondary storefetch interlock in which memory addresses of vector length elements of apreceding vector store instruction in a plurality of iterations of theloop and memory addresses of the vector length elements of a subsequentvector load instruction in the plurality of iterations of the loopoverlap with each other, but memory addresses of elements to be accessedby the preceding vector store instruction and the subsequent vector loadinstruction based on mask registers do not overlap with each other. 4.The compiler program according to claim 3, wherein, when the loop of theoptimization target program includes a vector store instruction after avector load instruction, the secondary store fetch interlock is a casein which (1) memory addresses of the vector length elements of thepreceding vector store instruction in a first iteration of the loop andmemory addresses of the vector length elements of the subsequent vectorload instruction in a second iteration, which is next to the firstiteration, overlap with each other, but (2) memory addresses of elementsto be accessed by the preceding vector store instruction in the firstiteration based on a mask register and memory addresses of elements tobe accessed by the subsequent vector load instruction in the seconditeration based on a mask register do not overlap with each other. 5.The compiler program according to claim 3, wherein, when the loopincludes a vector load instruction after a vector store instruction, thesecondary store fetch interlock is a case in which (1) memory addressesof the vector length elements of the preceding vector store instructionin a first iteration of the loop and memory addresses of the vectorlength elements of subsequent vector load instruction in the firstiteration overlap with each other, but (2) memory addresses of elementsto be accessed by the preceding vector store instruction in the firstiteration based on a mask register and memory addresses of elements tobe accessed by the subsequent vector load instruction in the firstiteration based on a mask register do not overlap with each other. 6.The compiler program according to claim 1, wherein when the loopincludes a vector store instruction after a vector load instruction, thenumber of times of unrolling is the first unrolling number of times, andwhen the loop includes a vector load instruction after a vector storeinstruction, the number of times of unrolling is the second unrollingnumber of times.
 7. The compiler program according to claim 1, whereinthe optimization processing further comprises converting, when an arraysize of the array variable is a variable and the vector length is aconstant or unknown, the loop into case-specific loops each includingthe vector store instruction and the vector load instruction in theloop, for each of a plurality of cases of the first unrolling number oftimes or the second unrolling number of times, calculated based on apossible combination of the array size and the vector length, and theunrolling includes unrolling processing for each of the case-specificloops, and the scheduling includes scheduling processing for each of thecase-specific loops.
 8. A method of compiling that executes optimizationprocessing for an optimization target program, the optimizationprocessing comprising: the optimization target program including a loopincluding a vector store instruction and a vector load instruction foran array variable, unrolling the vector store instruction and the vectorload instruction in the loop by an unrolling number of times to generatea plurality of unrolled vector store instructions and a plurality ofunrolled vector load instructions, the unrolling number of timesincluding a first unrolling number of times, which is obtained bydividing a vector length by an array size of the array variable androunding up a remainder, or a second unrolling number of times lowerthan the first unrolling number of times by one; and scheduling to movean unrolled vector load instruction among the plurality of unrolledvector load instructions, which is located after a first unrolled vectorstore instruction that is located at first among the plurality ofunrolled vector load instructions, before the first unrolled vectorstore instruction.
 9. An information processing device comprising: aprocessor; a memory accessed by the processor, wherein the processorexecutes optimization processing for an optimization target program, theoptimization target program including a loop including a vector storeinstruction and a vector load instruction for an array variable, and theoptimization processing including: unrolling the vector storeinstruction and the vector load instruction in the loop by an unrollingnumber of times to generate a plurality of unrolled vector storeinstructions and a plurality of unrolled vector load instructions, theunrolling number of times including a first unrolling number of times,which is obtained by dividing a vector length by an array size of thearray variable and rounding up a remainder, or a second unrolling numberof times lower than the first unrolling number of times by one; andscheduling to move an unrolled vector load instruction among theplurality of unrolled vector load instructions, which is located after afirst unrolled vector store instruction that is located at first amongthe plurality of unrolled vector load instructions, before the firstunrolled vector store instruction.